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Abstract 

We consider the problem of answering queries about formulas of propositional logic based 
on background knowledge partially represented explicitly as other formulas, and partially repre- 
sented as partially obscured examples independently drawn from a fixed probability distribution, 
where the queries are answered with respect to a weaker semantics than usual - PAC-Semantics, 
introduced by Valiant [5T] - that is defined using the distribution of examples. We describe a 
fairly general, efficient reduction to limited versions of the decision problem for a proof system 
(e.g., bounded space treelike resolution, bounded degree polynomial calculus, etc.) from cor- 
responding versions of the reasoning problem where some of the background knowledge is not 
explicitly given as formulas, only learnable from the examples. Crucially, we do not generate an 
explicit representation of the knowledge extracted from the examples, and so the "learning" of 
the background knowledge is only done implicitly. As a consequence, this approach can utilize 
formulas as background knowledge that are not perfectly valid over the distribution — essentially 
the analogue of agnostic learning here. 
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1 Introduction 



PAC-Semantics was introduced by Valiant [5T] in an attempt to unify statistical and logical ap- 
proaches to reasoning: on the one hand, given background knowledge represented as a collection of 
axioms, one may perform logical deduction, and on the other hand, given background knowledge 
represented as a collection of examples, one can derive a statistical conclusion by testing whether 
the conclusion is supported by a sufficiently large fraction of the examples. PAC-Semantics captures 
both sources. As is typical for such works, we can illustrate the utility of such a combined approach 
with a story about an aviary. Suppose that we know that the birds of the aviary fly unless they 
are penguins, and that penguins eat fish. Now, suppose that we visit the aviary at feeding time, 
and notice that most (but perhaps not all) of the birds in the aviary seem not to eat fish. From 
this information, we can infer that most of the birds in the aviary can fly. This conclusion draws 
on both the empirical (partial) information and reasoning from our explicit, factual knowledge: on 
the one hand, our empirical observations did not mention anything about whether or not the birds 
of the aviary could fly, and on the other hand, although our knowledge is sufficient to conclude that 
the birds that don't eat fish can fly, it isn't sufficient to conclude whether or not, broadly speaking, 
the birds in the aviary can fly. 

Valiant's original work described an application of PAC-Semantics to the task of predicting the 
values of unknown attributes in new examples based on the values of some known attributes of the 
example — for example, filling in a missing word in an example sentence [41]. In this work, by con- 
trast, we introduce and describe how to solve a (limited) decision task for PAC-Semantics, deciding 
whether or not a given "query" formula follows from the background knowledge, represented by 
both a collection of axiom formulas and a collection of examples. In particular, we use a model of 
partial information due to Michael [40] to capture and cope with reasoning from partially obscured 
examples from a target distribution. 

What we show is roughly that as long as we can efficiently use small proofs to certify validity 
in the classical sense and the rules of inference in the proof system are preserved under restric- 
tions, we can efficiently certify the validity (under PAC-Semantics) of a query from a sample of 
partial assignments whenever it follows from some formula(s) that could be verified to hold under 
the partial assignments. Thus, in such a case, the introduction of probability to the semantics 
in this limited way (to cope with the imperfection of learned rules) actually does not harm the 
tractability of inference. Moreover, the "learning" is actually also quite efficient, and imposes no 
restrictions on the representation class beyond the assumption that their values are observed under 
the partial assignments and the restrictions imposed by the proof system itself. In Section HI we 
will then observe that almost every special case of a propositional proof system with an efficient 
decision algorithm considered in the literature satisfies these conditions, establishing the breadth 
of applicability of the approach. 

It is perhaps more remarkable in from a learning theoretic perspective that our approach does 
not require the rules to be learned (or discovered) to be completely consistent with the examples 
drawn from the (arbitrary) distribution. In the usual learning context, this would be referred to 
as agnostic learning, as introduced by Kearns et al. [281 . Agnostic learning is notoriously hard — 
Kearns et al. noted that agnostic learning of conjunctions (over an arbitrary distribution, in the 
standard PAC-learning sense) would yield an efficient algorithm for PAC-learning DNF (also over 
arbitrary distributions), which remains the central open problem of computational learning theory. 
Again, by declining to produce a hypothesis, we manage to circumvent a barrier (to the state of the 
art, at least). Such rules of less-than-perfect validity seem to be very useful from the perspective of 
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AI: for example, logical encodings of planning problems typically use "frame axioms" that assert 
that nothing changes unless it is the effect of an action. In a real world setting, these axioms are 
not strictly true, but such rules still provide a useful approximation. It is therefore desirable that 
we can learn to utilize them. We discuss this further in Section [5j 

Relationship to other work Given that the task we consider is fundamental and has a variety 
of applications, other approaches have naturally been proposed — for example, Markov Logic |47j 
is one well-known approach based on graphical models, and Bayesian Logic Programming [29] is 
an approach that has grown out of the Inductive Logic Programming (ILP) community that can 
address the kinds of tasks we consider here. The main distinction between all of these approaches 
and our approach is that these other approaches all aim to model the distribution of the data, which 
is generally a much more demanding task - both in terms of the amount of data and computation 
time required - than simply answering a query. Naturally, the upshot of these other works is 
that they are much more versatile, and there are a variety of other tasks (e.g., density estimation, 
maximum likelihood computations) that these frameworks can handle that we do not. Our aim is 
instead to show how this more limited (but still useful) task can be done much more efficiently, 
much like how algorithms such as SVMs and boosting can succeed at predicting attributes without 
needing to model the distribution of the data. 

In this respect, our work is similar to the Learning to Reason framework of Khardon and 
Roth [30] . who showed how an NP-hard reasoning task (deciding a log n-CNF query), when coupled 
with a learning task beyond the reach of the state of the art (learning DNF from random examples) 
could result in an efficient overall system. The distinction between our work and Khardon and 
Roth's is, broadly speaking, that we re-introduce the theorem-proving aspect that Khardon and 
Roth had explicitly sought to avoid. Briefly, these techniques permit us to incorporate declaratively 
specified background knowledge and moreover, permit us to cope with partial information in more 
general cases than Khardon and Roth |31j . who could only handle constant width clauses. Another 
difference between our work and that of Khardon and Roth, that also distinguishes our work from 
traditional ILP (e.g., [12])) is that as mentioned above, we are able to utilize rules that hold with 
less than perfect probability (akin to agnostic learning, but easier to achieve here). 

2 Definitions and preliminaries 

PAC-Semantics Inductive generalization (as opposed to deduction) inherently entails the pos- 
sibility of making mistakes. Thus, the kind of rules produced by learning algorithms cannot hope 
to be valid in the traditional (Tarskian) sense (for reasons we describe momentarily), but intu- 
itively they do capture some useful quality. PAC-Semantics were thus introduced by Valiant [51] 
to capture the quality possessed by the output of PAC-learning algorithms when formulated in a 
logic. Precisely, suppose that we observe examples independently drawn from a distribution over 
{0, l} ra ; now, suppose that our algorithm has found a rule f(x) for predicting some target attribute 
xt from the other attributes. The formula "xj = f(x) v may not be valid in the traditional sense, as 
PAC-learning does not guarantee that the rule holds for every possible binding, only that the rule 
/ so produced agrees with xt with probability 1 — e with respect to future examples drawn from 
the same distribution. That is, the formula is instead "valid" in the following sense: 

Definition 1 ((1 — e)-valid) Given a distribution D over {0, 1}™, we say that a Boolean function 
R is (1 — e)-valid if Pt X £d[R(x) = 1] > 1 — e. If e = 0, we say R is perfectly valid. 



3 



Of course, we may consider (1 — e)-validity of relations R that are not obtained by learning 
algorithms and in particular, not of the form a Xt = f(x)." 

Classical inference in PAC-Semantics. Valiant [51J considered one rule of inference, chaining, 
for formulas of the form it = f(x) where / is a linear threshold function: given a collection of literals 
such that the partial assignment obtained from satisfying those literals guarantees / evaluates to 
true, infer the literal it- Valiant observed that for such learned formulas, the conjunction of literals 
derived from a sequence of applications of chaining is also 1 — e'-valid for some polynomially larger 
e'. It turns out that this property of soundness under PAC-Semantics is not a special feature of 
chaining: generally, it follows from the union bound that any classically sound derivation is also 
sound under PAC-Semantics in a similar sense. 

Proposition 2 (Classical reasoning is usable in PAC-Semantics) Let ipx, ... ,ipk be formu- 
las such that each ipi is (1 — ei) -valid under a common distribution D for some G [0, 1]. Suppose 
that {ipx, . . . , ipk} \= <p (in the classical sense). Then tp is 1 — e'-valid under D for e' = ej. 

So, soundness under PAC-Semantics does not pose any constraints on the rules of inference that 
we might consider; the degree of validity of the conclusions merely aggregates any imperfections in 
the various individual premises involved. We also note that without further knowledge of D, the 
loss of validity from the use of a union bound is optimal. 

Proposition 3 (Optimality of the union bound for classical reasoning) Let ipx,. . . ,ipk be 
a collection of formulas such that there exists some distribution D on which each ipi is 1 — ei-valid, 
for which {ipx, . . . , ipi-x, tpi+x, ■ ■ ■ , ipk} V= V^; an d J2i e « < 1- Then there exists a distribution D' for 
which each ipi is 1 — ei-valid, but ipx A • • • A ip/~ is not 1 — Y^- € i + ^ valid for any 5 > 0. 

Proof: Since Proposition [2] guarantees that ipx A • • • A ipk is at least 1 — ^ e^-valid where 
1 — e i > 0j there must be a (satisfying) assignment x^ for ipx A • • • A rpk- O n the other hand, 
as each ipi is not entailed by the others, there must be some assignment that satisfies the 
others but falsifies ipi. We now construct D'\ it places weight e« on the assignment x^\ and weight 
1 — ej on x(°\ It is easy to verify that D' satisfies the claimed conditions. H 

Subsequently, we will assume that our Boolean functions will be given by formulas of preposi- 
tional logic formed over Boolean variables {xi, . . . , x n } by negation and the following linear thresh- 
old connectives (which we will refer to as the threshold basis for propositional formulas): 

Definition 4 (Threshold connective) A threshold connective for a list ofk formulas (px, ■ ■ ■ , <pk 
is given by a list of k + 1 real numbers, ci, . . . , c^, b. The formula Ei=i Ci'Pi — b] is interpreted as 
follows: given a Boolean interpretation for the k formulas, the connective is true if 5Zi-<^=i °i — b- 

Naturally, a threshold connective expresses a fc-ary AND connective by taking the c« = 1, and 
b = k, and expresses a A:-ary OR by taking cx, ■ ■ ■ , Ck,b = 1. 

We note that Valiant actually defines PAC-Semantics for first-order logic by considering D to 
be a distribution over the values of atomic formulas. He focuses on formulas of bounded arity 
over a polynomial size domain; then evaluating such formulas from the (polynomial size) list of 
values of all atomic formulas is tractable, and in such a case everything we consider here about 
propositional logic essentially carries over in the usual way, by considering each atomic formula 
to be a propositional variable (and rewriting the quantifiers as disjunctions or conjunctions over 
all bindings). As we don't have any insights particular to first-order logic to offer, we will focus 
exclusively on the propositional case in this work. 
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Partial observability Our knowledge of a distribution D will be provided in the form of a 
collection of examples independently drawn from D, and our main question of interest will be 
deciding whether or not a formula is (1 — e)-valid. Of course, reasoning in PAC-Semantics from 
(complete) examples is trivial: Hoeffding's inequality guarantees that with high probability, the 
proportion of times that the query formula evaluates to 'true ' is a good estimate of the degree of 
validity of the formula. By contrast, if the distribution D is not known, then we can't guarantee 
that a formula is (1 — e)-valid for any e < 1 without examples without deciding whether the query 
is a tautology. So, it is only interesting to consider what happens "in between." To capture such 
"in between" situations, we will build on the theory of learning from partial observations developed 
by Michael [40] . 

Definition 5 (Partial assignments) A partial assignment p is an element o/{0, l,*} n . We say 
that a partial assignment p is consistent with an assignment x € {0, l} n if whenever pi ^ *, pi = x%. 

Naturally, instead of examples from D, our knowledge of D will be provided in the form of a 
collection of example partial assignments drawn from a masking process over D: 

Definition 6 (Masking process) A mask is a function m : {0, l} n —> {0, 1, *} n , with the prop- 
erty that for any x £ {0, l} n , m(x) is consistent with x. A masking process M is a mask-valued 
random variable (i.e., a random function). We denote the distribution over partial assignments 
obtained by applying a masking process M to a distribution D over assignments by M(D). 

Note that the definition of masking processes allows the hiding of entries to depend on the 
underlying example from D. Of course, since we know that when all entries are hidden by a 
masking process the problem we consider will become NP-hard, we must restrict our attention 
to settings where it is possible to learn something about D. In pursuit of this, we will consider 
formulas that can be evaluated in the straightforward way from the partial assignments with high 
probability — such formulas are one kind which we can certainly say that we know to be (essentially) 
true under D. 

Definition 7 (Witnessed formulas) We define a formula to be witnessed to evaluate to true or 
false in a partial assignment by induction on its construction; we say that the formula is witnessed 
iff it is witnessed to evaluate to either true or false. 

• A variable is witnessed to be true or false iff it is respectively true or false in the partial 
assignment. 

• is witnessed to evaluate to true iff <p is witnessed to evaluate to false; naturally, is 
witnessed to evaluate to false iff <fi is witnessed to evaluate to true. 

• A formula with a threshold connective [ci0i + - • - + Cfc(/>fc > b] is witnessed to evaluate to true iff 

witnessed true c i + Efcfc not witnessed min {°> c i) > b and u is witnessed to evaluate to false 
iff Ei:^ witnessed true c i + Efcfc not witnessed max{0, c 4 } < 6. (i.e., iff the truth or falsehood, 
respectively, of the inequality is determined by the witnessed formulas, regardless of what 
values are substituted for the non-witnessed formulas.) 

An example of particular interest is a CNF formula. A CNF is witnessed to evaluate to true in 
a partial assignment precisely when every clause has some literal that is satisfied. It is witnessed 
to evaluate to false precisely when there is some clause in which every literal is falsified. 
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Refining the motivating initial discussion somewhat, a witnessed formula is one that can be 
evaluated in a very local manner. When the formula is not witnessed, we will likewise be interested 
in the following "simplification" of the formula obtained from an incomplete evaluation: 

Definition 8 (Restricted formula) Given a partial assignment p and a formula <p, the restric- 
tion of 4> under p, denoted 4>\ p , is recursively defined as follows: 

• If (p is witnessed in p, then <f)\ p is the formula representing the value that eft is witnessed to 
evaluate to under p. 

• If (j> is a variable not set by p, <f>\ p = <j). 

• If 4> = and <f) is not witnessed in p, then <f>\ p = ->(ifi\ p )- 

• If 4> = E?=i c »V't — b] and <ft is not witnessed in p, suppose that ipi,...,ipt are witnessed 
in p (and tb^i, . . . ,tbk are not witnessed). Then <f)\ p is Ei=£+i ^(V^lp) — ^] where d = 

For a restriction p and set of formulas F , we let F\ p denote the set {(j>\ p '■ <p E F}. 

Proof systems. We will need a formalization of a "proof system" in order to state our theorems: 

Definition 9 (Proof system) A proof system is given by a sequence of relations {i?j}^ over 
formulas such that Ri is of arity-{i + \) and whenever Ri^ihjj^, ... ,tpj i ,<p) holds, {ipj 1 , ■ ■ ■ , } |= (p. 
Any formula <p satisfying Rq is said to be an axiom of the proof system. A proof of a formula <f> 
from a set of hypotheses H in the proof system is given by a finite sequence of triples consisting of 

1. A formula ipk 

2. A relation Ri of the proof system or the set H 

3. A subsequence of formulas tpj 1 with ji < k for I = 1, . . . , i (i.e., from the first compo- 
nents of earlier triples in the sequence) such that Ri(ipj 1 , . . . ,ipj i ,ipk) holds, unless ipk 6 H . 

for which (ft is the first component of the final triple in the sequence. 

Needless to say it is generally expected that Ri is somehow efficiently computable, so that the 
proofs can be checked. We don't explicitly impose such a constraint on the formal object for the 
sake of simplicity, but the reader should be aware that these expectations will be fulfilled in all 
cases of interest. 

We will be interested in the effect of the restriction (partial evaluation) mapping applied to 
proofs — that is, the "projection" of a proof in the original logic down to a proof over the smaller 
set of variables by the application of the restriction to every step in the proof. Although it may be 
shown that this at least preserves the (classical) semantic soundness of the steps, this falls short 
of what we require: we need to know that the rules of inference are preserved under restrictions. 
Since the relations defining the proof system are arbitrary, though, this property must be explicitly 
verified. Formally, then: 

Definition 10 (Restriction-closed proof system) We will say that a proof system over propo- 
sitional formulas is restriction closed if for every proof of the proof system and every partial as- 
signment p, for any (satisfactory) step of the proof Rk(ipi, ■ ■ ■ , ifik, 4>), there is some j < k such that 
for the subsequence ip^ , . . . , fy. Rj^ifi^ \ p , . . . , tpi j \ p , <p\ p ) is satisfied, and the formula 1 ("true") is 
an axiom^ 

This last condition is a technical condition that usually requires a trivial modification of any proof system to 
accommodate. We can usually do without this condition in actuality, but the details depend on the proof system. 
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So, when a proof system is restriction-closed, given a derivation of a formula <p from ipx, . . . , tpk, 
we can extract a derivation of p\ p from 'i/'ilp; ■ ■ ■ > V'fclp f° r an y partial assignment /) such that the 
steps of the proof consist of formulas mentioning only the variables masked in p. (In particular, we 
could think of this as a proof in a proof system for a logic with variables {xj : pi = *}•) In a sense, 
this means that we can extract a proof of a "special case" from a more general proof by applying 
the restriction operator to every formula in the proof. Again, looking ahead to Section HJ we will 
see that the typical examples of propositional proof systems that have been considered essentially 
have this property. 

We will be especially interested in limited versions of the decision problem for a logic given by a 
collection of "simple" proofs — if the proofs are sufficiently restricted, it is possible to give efficient 
algorithms to search for such proofs, and then such a limited version of the decision problem will 
be tractable, in contrast to the general case. Formally, now: 

Definition 11 (Limited decision problem) Fix a proof system, and let S be a set of proofs in 
the proof system. The limited decision problem for S is then the following promise problem: given 
as input a formula <p with no free variables and a set of hypotheses H such that either there is a 
proof of ip in S from H or else H ^ <p>, decide which case holds. 

A classic example of such a limited decision problem for which efficient algorithms exist is for 
formulas of propositional logic that have "treelike" resolution derivations of constant width (cf. 
the work of Ben-Sasson and Wigderson [7] or the work of Beame and Pitassi [6], building on work 
by Clegg et al. [H]). We will actually return to this example in more detail in Section 01 but we 
mention it now for the sake of concreteness. 

We will thus be interested in syntactic restrictions of restriction-closed proof systems. We wish 
to know that (in contrast to the rules of the proof system) these syntactic restrictions are likewise 
closed under restrictions in the following sense: 

Definition 12 (Restriction-closed set of proofs) A set of proofs S is said to be restriction 
closed if whenever there is a proof of a formula p from a set of hypotheses H in S, there is also a 
proof of <p\ p in from the set H\ p in S for any partial assignment p. 

3 Inferences from incomplete data with implicit learning 

A well-known general phenomenon in learning theory is that a restrictive choice of representation 
for hypotheses often imposes artificial computational difficulties. Since fitting a hypothesis is often 
a source of intractability, it is natural to suspect that one would often be able to achieve more if the 
need for such an explicit hypothesis were circumvented — that is, if "learning" were integrated more 
tightly into the application using the knowledge extracted from data. For the application of answer- 
ing queries, this insight was pursued by Khardon and Roth |30| in the learning to reason framework, 
where queries against an unknown DNF could be answered using examples. The trivial algorithm 
that evaluates formulas on complete assignments and uses the fraction satisfied to estimate the 
validity suggests how this might happen: the examples themselves encode the needed information 
and so it is easier to answer the queries using the examples directly. In this case, the knowledge is 
used implicitly: the existence of the DNF describing the support of the distribution (thus, governing 
which models need to be considered) guarantees that the behavior of the algorithm is correct, but 
at no point does the algorithm "discover" the representation of such a DNF. Effectively, we will 
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Algorithm 1: DecidePAC 
parameter: Algorithm A solving the limited decision problem for the class of proofs S. 
input : Formula (p, e, 5, 7 G (0, 1), list of partial assignments p^ l \ . . . , p( m ^ from M(D), 

list of hypothesis formulas H 
output : Accept if there is a proof of cp in S from H and formulas ipi,ip2, ■ ■ ■ that are 

simultaneously witnessed true with probability at least 1 — e + 7 on M(D); 

Reject if H =^ ip is not (1 — e — 7)-valid under D. 

begin 

B^[e-m\, FAILED <s- 0. 
foreach partial assignment in the list do 
if A(cp\ (i) , H\ p ) rejects then 

Increment FAILED, if FAILED > B then 
|_ return Reject 

_ return Accept 



develop an alternative approach that incorporates reasoning to cope with incomplete examples and 
explicit background knowledge, and yet retains the appealing circumvention of the construction of 
explicit representations for learned knowledge. In this approach, there are "axioms" that can be 
extracted from the observable data, which we suppose that if known, could be combined with the 
background knowledge to answer a given query. 

More formally, these "axioms" are formulas for which it is feasible to verify consistency with 
the underlying distribution (from the masked examples), that nevertheless suffice to complete a 
proof. This is necessary in some sense (cf. Proposition [32]), and at least seems to be not much more 
restrictive than the requirements imposed by concept learning. Specifically, we will utilize formulas 
that are witnessed to evaluate to true on the distribution over partial assignments with probability 
at least (1 — e). We will consider any such formulas to be "fair game" for our algorithm, much as 
any member of a given concept class is "fair game" for concept learning. 

We now state and prove the main theorem, showing that a variant of the limited decision 
problem in which the proof may invoke these learnable formulas as "axioms" is essentially no 
harder than the original limited decision problem, as long as the proof system is restriction-closed. 
The reduction is very simple and is given in Algorithm [TJ 

Theorem 13 (Adding implicit learning preserves tractability) LetS be a restriction-closed 
set of proofs for a restriction-closed proof system. Suppose that there is an algorithm for the limited 
decision problem for S running in time T(n, \ip\, \H\) on input p and H over n variables. Let D be 
a distribution over assignments, M be any masking process, and H be any set of formulas. Then 
there is an algorithm that, on input ip, H, 5 and e, uses 0{1/^ 2 log 1/5) examples, runs in time 
0(T(n, \(p\, \H\)-=z log and such that given that either 

• [H ip] is not (1 — e — 'y) -valid with respect to D or 

• there exists a proof p from {ipi, ■ ■ ■ , ipk} U H in S such that ipi, . . . are all witnessed to 
evaluate to true with probability (1 — e + 7) over M(D) 

decides which case holds. 

Proof: Suppose we run Algorithm [1] on m = y^- hi ^ examples drawn from D. Then, (noting 
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that we need at most log m bits of precision for B) the claimed running time bound and sample 
complexity is immediate. 

As for correctness, first note that by the soundness of the proof system, whenever there is 
a proof of v?L(») fro m H\ a), (f\„(i) must evaluate to true in any interpretation of the remaining 
variables consistent with H\ Thus, if H cp is not (1 — e — 7)-valid with respect to D, an 
interpretation sampled from D must satisfy H and falsify <p with probability at least e + 7; for 
any partial assignment p derived from this interpretation (i.e., sampled from M{D)), the original 
interpretation is still consistent, and therefore H\ p \/= (p\ p for this p. So in summary, we see that a 
p sampled from M{D) produces a formula <p\ p such that H\ p ^= (p\ p with probability at least e + 7, 
and so the limited decision algorithm A rejects with probability at least e + 7. It follows from 
Hoeffding's inequality now that for m as specified above, at least era of the runs of A reject (and 
hence the algorithm rejects) with probability at least 1 — 5. 

So, suppose instead that there is a proof in S of <p from H and some formulas ipi, ■ ■ ■ ,ipk that 
are all witnessed to evaluate to true with probability at least (1 — e + 7) over M(D). Then, with 
probability (1 — e + 7), ipi\ p , ■ ■ ■ ,ipk\p = 1- Then, since S is a restriction closed set, if we replace 
each assertion of some ipj with an invocation of Rq for the axiom 1, then by applying the restriction 
p to every formula in the proof, one can obtain a proof of ip\ p from H\ p alone. Therefore, as A 
solves the limited decision problem for S, we see that for each p drawn from M(D), A({p\ p , H\ p ) 
must accept with probability at least (1 — e + 7), and Hoeffding's inequality again gives that the 
probability that more than em of the runs reject is at most 8 for this choice of m. I 

The necessity of computationally feasible witnessing. The reader may, at this point, feel 
that our notion of witnessed values is somewhat ad-hoc, and suspect that perhaps a weaker notion 
should be considered (corresponding to a broader class of masking processes). Although it may 
be the case that a better notion exists, we observe in Appendix [A] that it is crucial that we 
use some kind of evaluation algorithm on partial assignments that is computationally feasible. 
Witnessed evaluation is thus, at least, one such notion, whereas other natural notions are likely 
computationally infeasible, and thus inappropriate for such purposes. 

4 Proof systems with tractable, restriction-closed special cases 

We now show that most of the usual propositional proof systems considered in the literature possess 
natural restriction-closed special cases, for which the limited decision problem may be efficiently 
solved. Thus, in each case, we can invoke Theorem [13] to show that we can efficiently integrate 
implicit learning into the reasoning algorithm for the proof system. 

4.1 Special cases of resolution 

Our first example of a proof system for use in reasoning in PAC-Semantics is resolution, a standard 
object of study in proof theory. Largely due to its simplicity, resolution turned out to be an excellent 
system for the design of surprisingly effective proof search algorithms such as DPLL [14] 113] . 
Resolution thus remains attractive as a proof system possessing natural special cases for which we 
can design relatively efficient algorithms for proof search. We will recall two such examples here. 
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The resolution proof system. Resolution is a proof system that operates on clauses — disjunctions 
of literals. The main inference rule in resolution is the cut rule: given two clauses containing a com- 
plementary pair of literals (i.e., one contains the negation of a variable appearing without negation 
in the other) A V x and B V —>x, we infer the resolvent Ay B. We will also find it convenient to 
use the weakening rule: from any clause C, for any set of literals £\, . . . ,1^, we can infer the clause 
C V l\ V • • • V £k- As stated, resolution derives new clauses from a set of known clauses (a CNF 
formula) . Typically, one actually refers to resolution as a proof system for DNF formulas by using 
a resolution proof as a proof by contradiction: one shows how the unsatisfiable empty clause _L can 
be derived from the negation of the input DNF. This is referred to as a resolution refutation of the 
target DNF, and can also incorporate explicit hypotheses given as CNF formulas. 

Treelike resolution proofs. The main syntactic restriction we consider on resolution refutations 
intuitively corresponds to a restriction that a clause has to be derived anew each time we wish to 
use it in a proof — a restriction that the proof may not (re-)use "lemmas." It will not be hard 
to see that while this does not impact the completeness of the system since derivations may be 
repeated, this workaround comes at the cost of increasing the size of the proof. A syntactic way of 
capturing these proofs proceeds by recalling that the proof is given by a sequence of clauses that 
are either derived from earlier clauses in the sequence, or appear in the input CNF formula (to be 
refuted). Consider the following directed acyclic graph (DAG) corresponding to any (resolution) 
proof: the set of nodes of the graph is given by the set of clauses appearing in the lines of the proof, 
and each such node has incoming edges from the nodes corresponding to the clauses earlier in the 
proof used in its derivation; the clauses that appeared in the input CNF formula are therefore the 
sources of this DAG, and the clause proved by the derivation corresponds to a sink of the DAG 
(i.e., in a resolution refutation, the empty clause appears at a sink of the DAG). We say that the 
proof is treelike when this DAG is a (rooted) tree — i.e., each node has at most one outgoing edge 
(equivalently, when there is a unique path from any node to the unique sink). Notice, the edges 
correspond to the use of a clause in a step of the proof, so this syntactic restriction corresponds to 
our intuitive notion described earlier. 

We are interested in resolution as a proof system with special cases that not only possess efficient 
decision algorithms, but are furthermore restriction-closed. We will first establish that (treelike) 
resolution in general is restriction-closed, and subsequently consider the effects of our additional 
restrictions on the proofs considered. For syntactic reasons (to satisfy Definition I10p . actually, we 
need to include a tautological formula 1 as an axiom of resolution. We can take this to correspond 
to the clause containing all literals, which is always derivable by weakening from any nonempty set 
of clauses (and is furthermore essentially useless in any resolution proof, as it can only be used to 
derive itself). 

Proposition 14 (Treelike resolution is restriction-closed) Resolution is a restriction- closed 
proof system. Moreover, the set of treelike resolution proofs of length L is restriction- closed. 

Proof: Assuming the inclusion of the tautological axiom 1 as discussed above, the restriction- 
closedness is straightforward: Fix an partial assignment p, and consider any step of the proof, 
deriving a clause C. If C appeared in the input formula, then CL appears in the restriction of the 
input formula. Otherwise, C is derived by one of our two rules, cut or weakening. For the cut rule, 
suppose C is derived from A V X{ and B V -iXj. If pi G {0, 1} then C can either be derived from 
(A V Xi)\ p or (B V -^Xi)\ p by weakening. If pi = * and C\ p ^ 1, then both (^4 V Xi)\ p and (B V ~^Xi)\ p 
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are not 1, and the same literals are eliminated (set to 0) in these clauses as in CL, so C\ p follows 
from the cut rule applied to Xi on these clauses. If C\ p ^ 1 followed from weakening of some other 
clause C, we know C'\ p ^ 1 as well, since any satisfied literals in C appear in C; therefore C\ p 
follows from weakening applied to CL- Finally, if CL = 1, then we already know that 1 can be 
asserted as an axiom. So, resolution is restriction-closed. 

Recalling the DAG corresponding to a resolution proof has nodes corresponding to clauses 
and edges indicating which clauses are used in the derivation of which nodes, note that the DAG 
corresponding to the restriction of a resolution proof as constructed in the previous paragraph has 
no additional edges. Therefore, the sink in the original DAG remains a sink. Although the DAG 
may now be disconnected, if consider the connected component containing the node corresponding 
to the original sink, we see that this is indeed a tree; furthermore, since every clause involved in the 
derivation of a clause corresponding to a node of the tree corresponds to another node of the tree 
and the overall DAG corresponded to a syntactically correct resolution proof from the restriction 
of the input formula, by the restriction-closedness of resolution, this tree corresponds to a treelike 
resolution proof of the restriction of the clause labeling the sink from the restriction of the input 
formula. As this is a subgraph of the original graph, it corresponds to a proof that is also no longer 
than the original, as needed. I 

Bounded-space treelike resolution. Our first special case assumes not only that the resolution 
proof is treelike, but also that it can be carried out using limited space, in the sense first explored 
by Esteban and Toran [18]. That is, we associate with each step of the proof a set of clauses 
that we refer to as the blackboard. Each time a clause is derived during a step of the proof, we 
consider it to be added to the blackboard; we also allow any clauses in the blackboard to be erased 
across subsequent steps of the proof. Now, the central restriction is that instead of simply requiring 
the steps of the proof to utilize clauses that appeared earlier in the proof, we demand that they 
only utilize clauses that appeared in the blackboard set on the previous step. We now say that the 
proof uses (clause) space s if the blackboard never contains more than s clauses. We note that 
the restriction that the proof is treelike means that each time we utilize clauses in a derivation, we 
are free to delete them from the blackboard. In fact, given the notion of a blackboard, it is easily 
verified that this is an equivalent definition of a treelike proof. Even with the added restriction to 
clause space s, treelike resolution remains restriction-closed: 

Proposition 15 The set of clause spaces treelike resolution proofs is restriction closed. 

Proof: Let a space-s treelike resolution proof II and any partial assignment p be given; we 
recall the corresponding treelike proof IT' constructed in the proof of Proposition [T^ we suppose 
that II derives the sequence of clauses {Cj}^ (for which C is derived on the ith step of II) and 
IT derives the subsequence {C^|p}P[. Given the corresponding sequence of blackboards {-Bj}^ 
establishing that II can be carried out in clause space s, we construct a sequence of blackboards 
B[ = {Cj\ p : Cj € Bi, 3k s.t. j = i^} for II', and take the subsequence corresponding to steps in IT, 

It is immediate that every B[, contains at most s clauses, so we only need to establish that 
these are a legal sequence of blackboards for IT'. We first note that whenever a clause is added to 
a blackboard B[ over B^. , then since (by construction) it was not added in i! 6 it must 

be that it is added (to By) in step ij, which we know originally derived Ci j in II, and hence in 
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Algorithm 2: SearchSpace 
input : CNF tp, integer space bound s > 1, current clause C 

output: A space-s treelike resolution proof of C from clauses in cp, or "none" if no such 
proof exists. 

begin 

if C is a superset of some clause C of <p then 
|_ return The weakening derivation of C from C . 

else if s > 1 then 

foreach Literal I such that neither I nor —>£ is in C do 

if ITi ^— SearchSpace (ip, s — 1, C V £) does not return none then 
if II2 ^— SearchSpace (tp, s,CV —>£) does not return none then 
|_ return Derivation of C from Ii\ and LT2 

else 
L return none 

_ return none 



II' derives Cj. \ p by construction of II' (so this is the corresponding jth step of IT'). Likewise, if a 

clause is needed for the derivation of any jth. step of II', by the construction of II' from II, it must 

be that C{. \ p 7^ 1 and whenever some step ij of II uses an unsatisfied clause from some earlier step 

t of II, then II' includes the step corresponding to t. Therefore there exists k such that t = i}-; and, 

in' 1 

as Ci k G Bij, Ci k \ p £ B^.. Thus, {B' { }j = [ is a legal sequence of blackboards for II'. ■ 

The algorithm for finding space-s resolution proofs, SearchSpace, appears as Algorithm [2j Al- 
though the analysis of this algorithm appears elsewhere, we include the proof (and its history) in 
Appendix [B] for completeness. 

Theorem 16 (SearchSpace finds space-s treelike proofs when they exist) If there is a space- 
s treelike proof of a clause C from a CNF formula <p, then SearchSpace returns such a proof, and 
otherwise it returns "none." In either case, it runs in time 0(\(p\ ■ n 2 ^ -1 )) where n is the number 
of variables. 

Naturally, we can convert SearchSpace into a decision algorithm by accepting precisely when it 
returns a proof. Therefore, as space-s treelike resolution proofs are restriction-closed by Proposi- 
tion [T5l Theorem [13] can be applied to obtain an algorithm that efficiently learns implicitly from 
example partial assignments to solve the corresponding limited decision problem for (1 — e)-validity 
with space-s treelike resolution proofs. Explicitly, we obtain: 

Corollary 17 (Implicit learning in space-bounded treelike resolution) Let a KB CNF </> 

and clause C be given, and suppose that partial assignments are drawn from a masking process for 
an underlying distribution D; suppose further that either 

1 . There exists some CNF tp such that partial assignments from the masking process are witnessed 
to satisfy ip with probability at least (1 — e + 7) and there is a space-s treelike proof of C from 
<p l\ip or else 

2. [(f) C] is at most (1 — e — 7) -valid with respect to D for 7 > 0. 
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Then, there an algorithm running in time 0{-^n 2 ^ s ^ log ^) that distinguishes these cases with 
probability 1 — 5 when given C, <f>, e, 7, and a sample of log |) partial assignments. 

A quasipolynomial time algorithm for treelike resolution. As we noted previously, Beame 
and Pitassi [6j gave an algorithm essentially similar to SearchSpace, but only established that it 
could find treelike proofs in quasipolynomial time. Their result follows from Theorem 1161 and the 
following generic space bound: 

Proposition 18 A treelike proof IT can be carried out in clause space at most log 2 |IT| + 1. 

So therefore, if there is a treelike proof of a clause C from a formula <p of size n k , SearchSpace 
(run with the bound s = klogn + 1) finds the proof in time 0(|y| • n 2klogn ). We also include the 
proof in Appendix [Bl 

Bounded- width resolution. Our second special case of resolution considers proofs using small 
clauses. Precisely, we refer to the number of literals appearing in a clause as the width of the clause, 
and we naturally consider the width of a resolution proof to be the maximum width of any clause 
derived in the proof (i.e., excluding the input clauses). Bounded- width resolution was originally 
formally investigated by Galil [20], who exhibited an efficient dynamic programming algorithm for 
bounded-width resolution. Galil's algorithm easily generalizes to /c-DNF resolution, i.e., the proof 
system RES(k), (with standard resolution being recovered by k = 1) so we will present the more 
general case here. 

Briefly, RES(k), introduced by Kraji'cek [32J, is a proof system that generalizes resolution by 
operating on /c-DNF formulas instead of clauses (which are, of course, 1-DNF formulas) and in- 
troduces some new inference rules, described below. In more detail, recall that a /c-DNF is a 
disjunction of conjunctions of literals, where each conjunction contains at most k literals. Each 
step of a RES(/c) proof derives a /c-DNF from one of the following rules. Weakening is essentially 
similar to the analogous rule in resolution: from a /c-DNF ip, we can infer the /c-DNF (p\/ip for any 
/c-DNF ip. RES(/c) also features an essentially similar cut rule: from a /c-DNF A V {i\ A • • • A lj) 
(j < k) and another /c-DNF BV^iV' • -V—>£j, we can infer the /c-DNF A\/B. The new rules involve 
manipulating the conjunctions: given j < k formulas £\ V A, £j\/A, we can infer (£% A- • • Alj) \/A 
by A-introduction. Likewise, given (^A---Afj)Vi, we can infer ii V A for any i = 1, . . . ,j by 
A- elimination. 

We wish to show that RES(/c) is restriction-closed; actually, for technical simplicity, we will 
represent 1 by the disjunction of all literals. This can be derived from any DNF by a linear number 
of A-elimination steps (in the size of the original DNF) followed by a weakening step, so it is not 
increasing the power of RES(/c) appreciably to include such a rule. 

Proposition 19 For any k, RES(k) is restriction-closed. 

Proof: We are given (by assumption) that our encoding of 1 is an axiom. Let any partial 
assignment p be given, and consider the DNF p derived on any step of the proof. Naturally, if 
ip was a hypothesis, then p\ p is also a hypothesis. Otherwise, it was derived by one of the four 
inference rules. We suppose that <p\ p 7^ 1 (or else we are done). Thus, if <p was derived by weakening 
from ip, it must be the case that ip\ p 7^ 1, since otherwise p\ p = 1, so p\ p follows from ip\ p again 
by weakening since every conjunction in ip\ p appears in p\ p . Likewise, if <p = li V A was derived by 
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A-elimination from ip = (l\ A • • • A lj) V A, then since 1%\ P ^\ and A\ p must not be 1, neither the 
conjunction li was taken from in ip nor the rest of the formula A evaluates to 1 and thus ip\ p ^ 1. 
Then, if some It is set to by p, ip\ p = A\ p , and ip\ p follows from ip\ p by weakening; otherwise, (p\ p 
still follows by A-elimination. 

We now turn to consider cp = (l\ A • • • A lj) V A that were derived by A-introduction. We first 
consider the case where some literal li in the new conjunction is set to in p (and so (p\ p = A\ p ). In 
this case, one of the premises in the A-introduction step was li V A, where (li V A) | p = A\ p = <p\ p , 
so in fact <p\ p can be derived just as li V A was derived. We now suppose that no li is set to in p; 
let ti x ,... ,li s denote the subset of those literals that are not set to 1 (i.e., satisfy li t \ p = li t )- Then 
vlp = C^ii A • • • A li s ) V A\ p , where since A\ p ^ 1, the premises li t V A used to derive <p all satisfy 
(li t V A) \ p = li t V A\ p 1, and so we can again derive (p\ p by A-introduction from this subset of the 
original premises. 

Finally, we suppose that <p = A V B we derived by the cut rule applied to A V (^i A • • • A lj) and 
BV^iV- • -V-ilj. If some li is set to by p, then the first premise satisifies (A V (^i A • • • /\£j))\ p = 
A\ p and so ip\ p = A\ p V B\ p can be derived by weakening from the first premise. If not, we let 

. . . , li g denote the subset of those literals that are not set to 1. Then the first premise becomes 
A\ p V (li ± A • • • A £i a ) ^ 1 (since we assumed <p\ p ^ 1) and likewise, the second premise becomes 
B\ p V V • • • V ->li a ^ 1 (as likewise B\ p ^ 1 and no li t \ p = 0), so ip\ p follows by the cut rule 
applied to these two premises. I 

Now, RES (A;) possesses a "bounded-width" restriction for which we will observe has a limited 
decision problem that can be solved by a dynamic programming algorithm (given in pseudocode as 
Algorithm [3]) . More precisely, we will say that a DNF has width w if it is a disjunction of at most 
w conjunctions, and so likewise the width of a RES(k) proof is the maximum width of any /c-DNF 
derived in the proof. 

Theorem 20 (Efficient decision of bounded-width RES(k)) Algorithm^ accepts iff there is 
a RES(k) proof of its input 4> from the input k-DNF formulas (p± . . . , (p£ of width at most w. If 
there are n variables, it runs in time 0(n kw+l (n kw + l) k m&x{kn kw , l^il})- 

Proof: The correctness is straightforward: if there is a width- w RES(/c) proof, then a new 
derivation step from the proof is performed on each iteration of the main loop until <fi is derived, 
and conversely, every time T[ip] is set to 1, a width-it; derivation of if) could be extracted from the 
run of the algorithm. So, it only remains to consider the running time. 

The main observation is that there are at most 0(n kw ) width-u> /c-DNFs. (The initialization 
thus takes time at most 0(n kw l).) At least one of these must be derived on each iteration. Each 
iteration considers all possible derivations using up to k distinct formulas either in the table or 
given in the input, of which there are 0((n kw +l) k ) tuples. We thus need to consider only the time 
to check each of the possible derivations. 

A formula ip\ must be a width-u; fc-DNF for another width-w /c-DNF ip' to be derivable via 
weakening, and then for each other width- w fc-DNF tp', we can check whether or not it is a weakening 
of ipi in time 0(n kw ) by just checking whether all of the conjunctions of ipi appear in ip'. Likewise, 
for A-introduction, the formula must already be a width-u; /c-DNF, and we can check whether or 
not the j < k formulas have a shared common part by first checking which conjunctions from the 
first formula appear in the second, and then, if only one literal is left over in each, checking that 
the other j — 2 formulas have the same common parts with one literal left over. We then obtain 
the resulting derivation by collecting these j literals, in an overall time of 0(kn kw ). 
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For the A-elimination rule, the formula must already be width-w for us to obtain a width-w; 
result. Then, we can easily generate each of the possible results in time linear in the length of 
the formula, that is, 0(n kw ). For the cut rule, we only need to examine each conjunction of each 
formula, and check if the literals appear negated among the conjunctions of the other formula, 
taking time linear in the size of the formulas, which is 0(max{n kw , |v?i|})> checking that the result 
is a width-u> /c-DNF then likewise can be done in linear time in the size of the formulas. I 

Finally, we note that the width-w syntactic restriction of RES(fc) refutations is restriction-closed: 

Proposition 21 The set of width-w RES{k) refutations is restriction-closed. 

Proof: Let any width-w RES(/c) refutation IT and partial assignment p be given. In the con- 
struction used in Proposition 1191 we obtained a proof IT of _L|p = _L from II with the property that 
every formula appearing in IT satisfies tp' = ip\ p for some tp appearing in II. Furthermore, we 
guaranteed that no derivation step used a formula that simplified to 1. It therefore suffices to note 
that for any width- w A;-DNF tp, tp\ p is also a /c-DNF with width at most w. I 

By Theorem 113} DecidePAC can be applied to Algorithm [3] to obtain a second implicit learning 
algorithm, for a width-w RES(fc). 

Corollary 22 (Implicit learning in bounded-width RES(k)) Let a KB of k-DNFs <p\ . . . ,<pe 
and target disjunction of k-CNFs p be given, and suppose that partial assignments are drawn from 
a masking process for an underlying distribution D; suppose further that either 

1. There exists some conjunction of k-DNFs ip such that partial assignments from the masking 
process are witnessed to satisfy ip with probability at least (1 — e + 7) and there is a width-w 
RES(k) refutation of -up A <p\ A • • • A (p£ A tp or else 

2. [4>\ A • • • A (p£ =4> ip] is at most (1 — e — -valid with respect to D for 7 > 0. 

Then, there an algorithm running in time 0(n kwJrl (n kw + £) k m.aK{kN kw , |<^i|}^ log |) that distin- 
guishes these cases with probability 1 — 5 when given <p, (pi . . . , <pe, e, 7, and a sample of log 
partial assignments. 

4.2 Degree-bounded polynomial calculus 

Our next example proof system is Polynomial calculus, an algebraic proof system originally intro- 
duced by Clegg et al. |11] as a (first) example of a proof system that could simulate resolution (the 
gold standard for theorem-proving heuristics) on the one hand, and possessing a natural special 
case for which the limited decision problem could demonstrably be solved in polynomial time using 
a now standard computer algebra algorithm, the Grobner basis algorithm due to Buchberger [8]. 
Although the original hopes of Clegg et al. - that polynomial calculus might one day supplant 
resolution as the proof system of choice - have not been fulfilled due to the fact that heuristics 
based on resolution have been observed to perform spectacularly well in practice, it nevertheless 
represents a potentially more powerful system that furthermore alludes to the diversity possible 
among proof systems. 

The polynomial calculus proof system. In polynomial calculus, formulas have the form of 
polynomial equations over an arbitrary nontrivial field F (for the present purposes, assume F is 
Q, the field of rationals), and we are interested in their Boolean solutions. A set of hypotheses is 
thus a system of equations, and polynomial calculus enables us to derive new constraints that are 
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satisfied by any Boolean solutions to the original system. Of course, in this correspondence, our 
Boolean variables serve as the variables of the polynomials. 

More formally, for our Boolean variables x±, . . . , x n , our formulas are equations of the form [p = 
0] for p G F[xi, . . . ,x n ] (i.e., formal multivariate polynomials over the field F with indeterminates 
given by the variables). We require that the polynomials are represented as a sum of monomials: 
that is, every line is of the form 

E c * n x ?=° 

seN n igsupp(s) 

for coefficients c s € F, where the products r[jesupp(s) x i l are * ne monomials corresponding to the 
degree vector s. For each variable, the proof system has a Boolean axiom [x 2 — x = 0] (asserting that 
x € {0, 1}). The rules of inference are linear combination, which asserts that for equations [p = 0] 
and [q = 0], for any coefficients a and b from F, we can infer [a ■ p + b ■ q = 0]; and multiplication, 
which asserts that for any variable (indeterminate) x and polynomial equation [p = 0], we can 
derive [x ■ p = 0]. A refutation in polynomial calculus is a derivation of the polynomial 1, i.e., the 
contradictory equation [1 = 0]. We will encode "true" as the equation [0 = 0], and we will modify 
the system to allow this equation to be asserted as an axiom; of course, it can be derived in a single 
step from any polynomial calculus formula [p = 0] by the linear combination p + (— l)p, so we are 
essentially not changing the power of the proof system at all. 

We also note that without loss of generality, we can restrict our attention to formulas in which 
no indeterminate appears in a monomial with degree greater than one — such monomials are referred 
to as multilinear. Intuitively this is so because the Boolean axioms assert that a larger power can 
be replaced by a smaller one; formally, one could derive this as follows: Suppose we have a formula 
with a monomial expression x k ■ m. Then by multiplying the Boolean axiom by x k — 2 times, and 
then by the indeterminates in m, one obtains [x k ■ m — x k ^ 1 ■ m = 0]. A linear combination with 
the original formula then yields an expression with the original monomial replaced by x k ^ 1 ■ m, so 
by repeating this trick k — 2 additional times, we eventually reduce the monomial to x ■ m. The 
same trick can be applied to the rest of the indeterminates appearing in m, and then to the rest 
of the monomials in the formula. We will refer to this as the multilinearization of the formula. 
(The original formula could be re-derived by a similar series of steps, so nothing is lost in this 
translation.) Looking ahead, we will be focusing on the degree-bounded restriction of polynomial 
calculus, and so we will assume for simplicity that all formulas are expressed in this multilinearized 
(minimal-degree) form. Of course, because the translation can be performed in a number of steps 
that is quadratic in the total degree and linear in the size of the formula, this does not alter the 
power of the proof system by much at all. 

A note on witnessing and restrictions. The polynomial equations can be fit into our frame- 
work of restrictions and witnessing somewhat naturally, thanks to our restriction to the sum of 
monomials representation: since we have restricted our attention to cases where each variable 
(hence, indeterminate in the polynomial) takes only Boolean values, we observe that a monomial 
corresponds (precisely) to a conjunction over the set of variables in the support of its degree vector. 
Then, if say F is Q, we can then express the polynomial equation 

E c * n < i =° 

seN n iGsupp(s) 
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in the threshold b conjunction of two thresholds: 



^2 cs f\ Xi > -c J A ^2 ~°s A Xi - c ® 

v SC{xi,...,x»},S^0 ieS ) \SC{xi,...,x n },S^Hl ieS , 

for cs = X^sGN«-supp(s)=5 c s- The reader may verify that the effect of a restriction p is now 



J2 c * n x i 

iGsupp(s) 



E 



n 



seN™:pi=0=>Si=0 iesupp(s):pi^l 



where we thus denote the polynomial arising from applying p to [p = 0] by p\ p . 

This has the effect that the polynomial equation is witnessed true if all of the monomials (with 
nonzero coefficients) are witnessed, and the equation evaluates to 0, and witnessed false if enough 
of the monomials are witnessed so that regardless of the settings of the rest of the variables, the 
sum is either too large or too small to be zero. Once again, this is a weak kind of "witnessed 
evaluation" that is nevertheless feasible, and saves us from trying to solve a system of multivariate 
polynomial equations — which is easily seen to be NP-hard (NP-complete if we know we are only 
interested in Boolean solutions). 



Polynomial calculus with resolution. Although polynomial calculus can encode the literal 
—ix as the polynomial (1 — x), the effect of this choice on the encoding of a clause is undesirable: for 
example, recalling the correspondence between monomials and conjunctions, the clause x% V- • • \/x n 
corresponds to the polynomial (1 — x±) ■ ■ ■ (1 — x n ) which has an exponential-size (in n) monomial 
representation, and hence requires an exponential-size polynomial calculus formula. In the interest 
of efficiently simulating resolution in polynomial calculus, Alekhnovich et al. [TJ introduced the 
following extension of polynomial calculus known as polynomial calculus with resolution (PCR): 
the formulas are extended by introducing for each variable x, a new indeterminate x, related by 
the complementarity axiom [x + x — 1 = 0] (forcing x = ~^x). We can thus represent any clause 
i\ V ■ ■ ■ V Ik as a polynomial calculus formula using a single monomial [(~^i) • • • (^ik) = 0] by 
choosing the appropriate indeterminate for each The reader may verify that in such a case, 
the cut rule is captured by adding the monomials (with coefficients of 1) and weakening may be 
simulated by (repeated) multiplication. 

For the purposes of (partial) evaluation in PCR, our intended semantics for the x formulas is as 
follows: a partial assignment p assigns p(x) = * whenever p(x) = *, and otherwise p(x) = ~^p{x). 

Proposition 23 Polynomial calculus and polynomial calculus with resolution are restriction-closed. 

Proof: Let any partial assignment p be given. If a proof step asserts a hypothesis [p = 0], 
then its restriction \p\ p = 0] can also be asserted from the restriction of the hypothesis set. The 
Boolean axiom [x 2 — x = 0] can easily be seen to simplify to [0 = 0] if p assigns a value to x, and 
otherwise [x 2 — x = 0] L = [x 2 — x = 0] , so in the latter case we can simply assert the Boolean axiom 
for x. For polynomial calculus with resolution, we need to further consider the complementarity 
axioms, but as a is witnessed precisely when x is witnessed, we again have that if p(x) ^ *, then 
the complementarity axiom simplifies to [0 = 0], and otherwise [x + x — 1 = 0]| p = [x + x — 1 = 0], 
so we can simply assert the corresponding complementarity axiom. 
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Given our inclusion of [0 = 0] as an axiom, it only remains to show that the rules of inference 
are preserved under partial evaluations. If (p is derived by a linear combination of [p = 0] and 
[q = 0] (say (p is [ap + bq = 0]), then given our encoding of 1 as the formula [0 = 0], in any case, 
(ap + bq)\ p = a(p\ p ) + b(q\ p ), so (p\ p follows by the same linear combination from \p = 0] | p and 
[q = 0]\ p . If if is derived by multiplication by x from [p = 0], if p(x) = 0, then (p\ p = [0 = 0], which 
is an axiom. Two cases remain: either p(x) = 1, in which case ip\ p = [p = 0]\ p and so ip\ p follows 
trivially; or, p(x) = * and so <p\ p = [x ■ (p\ p ) = 0], so cp follows from [p = 0]| p by multiplication by 
x. ■ 

4.2.1 Degree-bounded polynomial calculus 

Given that the monomial representation of polynomials (in contrast to the clauses we considered 
in resolution) may be of exponential size in n (the number of variables), it is natural to wish to 
consider a restricted class of formulas in which the representations of formulas are guaranteed to 
be of polynomial size. One way to achieve this is to consider only degree-ci polynomials for some 
fixed constant d — then there are only ^f_ (™) = 0(n d ) (multilinear) monomials, and so (as long 
as the coefficients are reasonably small) we have a polynomial-size representation. We assume that 
an ordering of the monomials has been fixed (e.g., in the representation) such that monomials with 
larger degree are considered "larger" in the ordering. We refer to the first monomial in this ordering 
with a nonzero coefficient as the leading monomial in a polynomial. We will refer to the degree 
of a polynomial calculus or PCR proof as the maximum degree of any polynomial appearing in a 
formula used in the proof. We observe that width-u; resolution can be simulated by degree-u; PCR 
proofs; thus, in a sense, degree-bounded polynomial calculus is a natural generalization of width-u; 
resolution. 

Degree-bounded polynomial calculus in particular was also first studied by Clegg et al. |11| . 
The central observation is that the polynomials derivable in bounded degree polynomial calculus 
form a vector space; the decision algorithm (given as Algorithm U]) will then simply construct a 
basis for this space and use the basis to check if the query lies within the space. 

Theorem 24 (Analysis of decision algorithm for degree-d PC/PCR - Theorem 3, [11]) 

Algorithm^ solves the limited decision problem for degree-d polynomial calculus (resp. PCR). It 
runs in time 0((n d + £)n 2d ) where n is the number of indeterminates (variables for polynomial 
calculus, literals for PCR). 

As the proof appears in the work of Clegg et al. |llj . we refer the reader there for details. Clegg 
et al. [11] also give another algorithm based on the Grobner basis algorithm that does not compute 
an entire basis. Although their analysis gives a worse worst-case running time for this alternative 
algorithm, they believe that it may be more practical; the interested reader should consult the 
original paper for details. 

In any case, we now return to pursuing our main objective, using Algorithm 0] to obtain algo- 
rithms for implicit learning from examples in polynomial calculus and PCR. We first need to know 
that the degree-d restrictions of these proof systems are restriction-closed, which turns out to be 
easily established: 

Proposition 25 For both polynomial calculus and PCR, the sets of proofs of degree d are restriction- 
closed. 
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Proof: We noted in Proposition [23] that the restriction of any polynomial calculus (resp. PCR) 
proof is a valid polynomial calculus (resp. PCR) proof. Let any partial assignment p be given; 
recalling the connection between monomials and conjunctions, we note that for any monomial 
ac^ • • • Xi k k < d appearing in a formula in a degree-d polynomial calculus or PCR proof, the 
restriction under p is (of degree 0) if any Xi j is set to by p, and otherwise it is IJj.pfe )=* 
which has degree at most k < d. Thus, the degrees can only decrease, so the restriction of the proof 
under p is also a degree-ci proof. I 

We therefore obtain the following corollary from Theorem 1131 

Corollary 26 (Implicit learning in degree-bounded polynomial calculus and PCR) Let a 

list of degree-d polynomials pi,. . . ,P£ and q be given, and suppose that partial assignments are drawn 
from a masking process for an underlying distribution D; suppose further that either 

1. There exists some list of polynomials h\,...,hk such that partial assignments from the masking 
process are witnessed to satisfy [hi = 0], . . . , [hk = 0] with probability at least (1 — e + 7) 
and there is a degree-d polynomial calculus (resp. PCR) derivation of [q = 0] from [pi = 
0], . . . , [pe = 0],[hi = 0], . . . , [h k = 0] or else 

2. [(pi = 0) A • • • A (pe = 0) (q = 0)] is at most (1 — e — 7) -valid with respect to D for 7 > 0. 
Then, there an algorithm running in time 0( ^ + " n M log^) (given unit cost field operations) that 
distinguishes these cases with probability 1 — $ when given q, p\,...,pi, e, 7, and a sample of 
0(-? log |) partial assignments. 

4.3 Sparse, bounded cutting planes 

In integer linear programming, one is interested in determining integer solutions to a system of linear 
inequalities; cutting planes |23| were introduced as a technique to improve the formulation of an 
integer linear program by deriving new inequalities that are satisfied by the integer solutions to the 
system of inequalities, but not by all of the fractional solutions. The current formulation of cutting 
planes is due to Chvatal [10J, and it was explicitly cast as a propositional proof system by Cook et 
al. |12| where the objective is to prove that a system has no feasible integer solutions. Much like 
resolution, cutting planes are not only simple and natural, surprisingly, they are also complete [TOl 
112] . Furthermore, Cook et al. [12] noted that cutting planes could easily simulate resolution, and 
that some formulas that were hard for resolution (encoding the "pigeonhole principle") had simple 
cutting plane proofs. 

We can also give a syntactic analogue of bounded- width in resolution for cutting planes which 
will enable us to state a limited decision problem with an efficient algorithm. Although this restric- 
tion of cutting planes will not be able to express the hard examples for resolution, their simplicity 
and connections to optimization make them a potentially appealing direction for future work. 

The cutting planes proof system. The formulas of cutting planes are inequalities of the form 
[^ t fc =1 Cjij > b] where each xi is a variable and c\,...,Ck and b are integers. Naturally, we will 
restrict our attention to {0, l}-integer linear programs (i.e., Boolean- valued), so our system will 
feature axioms of the form x > and —x > — 1 (i.e., x < 1) for each variable x. Naturally, we 
will allow the addition of two linear inequalities: given = Ei=i cp x i ^ b^] and = 

E?=i c ? )x i > fe(2) ]> we can derive f W + ( P (2) = [52i=l( c i 1} + c ? ] ) x i > b {1) +bW}. We will also allow 
ourselves to multiply an inequality Ej=i °i x i — by any positive integer d to obtain [J2i=i(d ' 
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Ci)xi > d-b]. Finally, the key rule is division: given an inequality of the form [Yli=i(d-Ci) x i > b] for 
a positive integer d (i.e., a common divisor of the coefficients) we can derive Ei=i c i x i — \b/d\]; 
crucially, this derivation is only sound due to the fact that the Xi are assumed to take integer 
values. It is the fact that this rounding may "cut" into the region defined by the system of linear 
inequalities that gives the proof system its name. A refutation in cutting planes is a derivation of 
the (contradictory) inequality [0 > 1]. 

Again, we will need to make some technical modifications that do not change the power of the 
proof system by much. We will encode 1 as an axiom by the inequality > — 1 which, we note, 
can be trivially derived in two steps by the standard formulation of cutting planes. We will also 
introduce a weakening rule: consider any linear inequality Ei=i c^Xi > b^] that is witnessed 
true in every partial assignment, specifically in the one that masks all variables — this means that 
J2i=i min{0, cp} > b. Then, from any linear inequality Ei=i c*P x i — b^], we will allow ourselves 

to derive Ei=i(cf ) + cf ] ) x i > + fo{2) ] in a single step. Of course, E*=i ^ x i > cou ld 
be derived from the axioms in at most 3n + 2 steps if there are n variables while using only two 
formulas' worth of space, whereupon the final inequality follows by addition. 

We will also find the following observation convenient: as restrictions are a kind of partial 
evaluation, it is intuitively clear that we can perform the evaluation in stages and obtain the same 
end result, that is: 

Proposition 27 (Restrictions may be broken into stages) Let p be a partial assignment, and 
let a be another partial assignment such that for every variable x^, whenever pi = *, <jj = *, and 
whenever U{ S {0,1}, U{ = pi. Now, let r be a partial assignment to the variables {xi : Oi = *} 
such that for every xi, Oi = pi. Then for every formula (p, (p\ p = (tp\ a )\ T . 

Proof: We can verify this by induction on the construction of (p: 

• Naturally, for variables Xi, either pi = *, in which case Xi\ p = X{ = {xi\ a )\ T , or else pi G {0, 1} 
in which case either Oi = pi, or else Xi\ a = Xi, and then n = pi. 

• If <p = we have by the induction hypothesis that ip\ p = (ip\,j)\ T . Regardless of whether or 
not <p is witnessed, <p\ p = ->(ip\ p ) = ^((ip\a)\r) = (<p|ct)|t- 

• If ip = Ei=i Ci^Ai > b], we again have by the induction hypothesis that for every ipi, ipi\ p = 
(^ , i|<t)|t) an d thus, the same ipi are witnessed (to evaluate to true or false) in both cases. 

— If 99 is not witnessed in p, it is then immediate that <p\ p = (ip\ a )\ T . 

— If ip is witnessed in p, but not witnessed in a, we observe that (p must be witnessed in r 
since the same set of formulas are witnessed to evaluate to true and false in both cases, 
and therefore also again, ip\ p = (ip\ a )\ T . 

— Finally, when <p is witnessed in a, we note that by the construction of witnessed values, 
it does not matter what values the formulas witnessed by p but not a take — <p must be 
witnessed to take the same value under both p and a. Then since (<^| ct )|t = G {0, 1}, 
we see once again (</?| CT )|r = <p\p- 
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Proposition 28 Cutting planes is restriction-closed. 



Proof: We are again given that our encoding of 1, > —1, is an axiom. Now, let any partial 
assignment p be given. Again, for any hypothesis tp, asserted in the proof, tp\ p can be asserted from 
the set of restrictions of hypotheses. Likewise, for each axiom, if p assigns the variable a value, 
then it simplifies to 1 (which is given as an axiom by assumption) and otherwise it remains an 
assertion of the same axiom, so in either case it may still be asserted as an axiom. It thus remains 
to consider formulas derived by our four inference rules. 

We thus consider any formula tp derived in the proof that is not witnessed to evaluate to true 
in p. If it was derived from a formula tp by weakening, we note that if ip\ p = 1 (i.e., was witnessed 
to evaluate to true), then since <p is the sum of tp and another inequality £ that is witnessed to 
evaluate to true, we would have tp\ p = 1 also, but it is not by assumption. Therefore also tp\ p ^ 1. 
Furthermore, by Proposition 1271 £|p is (also) witnessed true on every further partial assignment. 
Therefore, <p\ p = (tp + £)\ p follows from ip\ p by weakening (with £L). Similarly, if tp was derived by 
addition of tp and £, at least one of tp and £ must not be witnessed to evaluate to 1 under p; WLOG 
suppose it is tp. Then if £L = 1, tp\„ again follows from ip\ p by weakening. Finally, if neither tp nor 
£ is witnessed to evaluate to true under p, we can derive tp\ p from tp\ p and £| p by addition. 

Multiplication is especially simple: we note that if tp = Ei=i(<^ ' c i) x i > d • b] is derived 
by multiplication from tp = Ej=i °i x i — &]> then tp also follows from tp by division, and hence 
tp\ p = 1 iff tp\ p = 1 in this case; as we have assumed tp \„ ^ 1, we note that we can derive tp \ p 
from tp\ p by multiplication by the same d. Finally, if tp = Ei=i °i x i — f6/dT|] was derived from 
tp = [Yli=i(d-Ci)xi > b] by division, we note (more carefully) that iiip\ p = 1, then as this means that 
Yli- Pi =i min{0, d ■ Cj} > b where the LHS is an integer, and hence also Ylr Pi =i min{0, q} > \b/d~\, 
so tp would also be witnessed to evaluate to true, but we have assumed it does not. Now, we note 
that 

tp\ p = (d- Ci)xi > lb- (d ■ cj) 

i:pi=* y i:pi=l 

where division by d therefore yields 



^ ~] Ci x i ^ 



i:pi=* 



i-.pi=l 



r.pi = * 



as Yji-pi=i Ci i s an m t e g er - 




<P\p 



4.3.1 Efficient algorithms for sparse, ^i-bounded cutting planes 

We now turn to developing a syntactic restriction of cutting planes that features an efficient limited 
decision algorithm. 



Sparse cutting planes. The main restriction we use is to limit the number of variables appearing 
in the threshold expression: we say that the formula is w-sparse if at most w variables appear in 
the sumH Naturally, we say that a cutting planes proof is w-sparse if every formula appearing in 
the proof is u>-sparse. 

2 Naturally, this is a direct analogue of width in resolution; the reason we do not refer to it as "width" is that in the 
geometric setting of cutting planes, width strongly suggests a geometric interpretation that would be inappropriate. 
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^i-bounded coefficients. We will also use a restriction on the magnitude of the (integer) coeffi- 
cients. Given a formula of cutting planes, ip = Ei=i °i x i > b], we define the i\-norm of p (denoted 
\\p\\i) to be \b\ + Yli=i l c «l) i- e -) the t\ norm of the coefficient vector. For L G N, we naturally say 
that a cutting planes proof is L-bounded if every ip appearing in the proof has \\<p\\i < L. 

We remark that the natural simulation of width-u> resolution by cutting planes yields u>-sparse 
and 2ui-bounded proofs: intuitively, we wish to encode a clause C = l\ V • • • V t\~ by the linear 
inequality 



E 



E 



(1-Xj) > 1 



which naturally corresponds to the cutting planes formula 



E *i+ E (- 



■l)Xj > 1 



\{i : l{ negative}| 



in which, if k < w, the coefficients from the LHS contribute at most w to the £i-norm, and the 
threshold is easily seen to contribute at most w (assuming w > 1). So, a simultaneously sparse and 
^i-bounded restriction of cutting planes generalizes the width-bounded restriction of resolution. 

We furthermore need to know that this special case of cutting planes is restriction-closed — 
note that other natural special cases, e.g., bounding the sizes of individual coefficients may not be. 
Nevertheless, for the ^i-bounded cutting planes, this is easily established: 

Proposition 29 The class of L-bounded w- sparse cutting plane proofs is restriction closed for any 
L,w e N. 



Proof: Let any L-bounded u>-sparse cutting plane proof II and partial assignment p be given. 
We consider the proof II L obtained by restricting every step of II by p (shown to be a cutting 
planes proof in Proposition [28]). Now, we note that in this proof, our encoding of 1 as [0 > —1] is 
0-sparse and l-bounded, so it is guaranteed to be L-bounded and w-sparse. More generally, given 
any <p that is L-bounded and w-sparse, 



<P\p 



E °i X i - b ~ E 



v.pi 



i:p(oLi) = \ 



has £i-norm 



mp\\i 



E' 

V.pi = \ 



+ E 



< 161 



+ E 



v.pi 



i-pi^0 



i:p i= l c i|; 



we conclude that 



mph 



< 



mh 



< L, 



by the triangle inequality; as furthermore < 

so nL is also L-bounded. Similarly, since every variable appearing in p\ p appears in p and p 
appearing in IT are assumed to be ^-sparse, p\ p appearing in n| p are also w-sparse. Thus, n| p is 
also a u>-sparse cutting planes proof, as needed. ■ 

We now consider Algorithm^ an analogue of Algorithm[3]- i.e., a simple dynamic programming 
algorithm - for the limited decision problem for zu-sparse and L-bounded cutting planes. 
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Theorem 30 (Analysis of decision algorithm for sparse, bounded cutting planes) For any 

w,L € N, Algorithm^ solves the limited decision problem for w-sparse L-bounded cutting planes. 
It runs in time 0{(w + m&x{\(fti\})L(Ln) w (L(Ln) w + £) 2 ) (which, for w constant and w-sparse 
L-bounded (fti is 0(L 3 (Ln) 3w )) where n is the number of variables. 

Proof: The analysis is very similar to our previous dynamic programming algorithms for bounded- 
width RES (A;), Theorem 1201 As there, we are inductively guaranteed that at each stage we set T[ip] 
to 1 only when there is a w-sparse L-bounded proof of ip, and conversely, for every ip with a w- 
sparse L-bounded proof, until T[ip] is set to 1, on each iteration of the main loop, we set an entry 
of T to 1 for some new step of the proof (we noted that weakening could be simulated by repeated 
addition of axioms, so we don't need to consider it explicitly). Thus, if the input target (ft nas a 
w-sparse L-bounded proof, T[(f>] would be set to 1 at some point, whereupon the algorithm accepts, 
and otherwise since the size of the table is bounded, the algorithm eventually cannot add more 
formulas to the table and so rejects. It only remains to consider the running time. 

The main observation is that there are at most (^Vj 1 ') = 0{L W+1 ) ways of assigning integer 
weights of total £i-weight at most L to the w nonzero coefficients and the threshold; therefore, 
as there are at most 0{n w ) distinct choices of up to w variables, there are at most 0{L w+l n w ) 
possible w-sparse L-bounded cutting plane formulas. At least one is added on each iteration of 
the loop, and each iteration considers every pair of such formulas with the £ input formulas (for 
0((L(Ln) w + £) 2 ) pairs on each iteration), where this sum can be carried out and checked in 
O(w + max{|0j|}) arithmetic operations; checking the O(L) possible multiples and divisors for each 
of the 0(L(Ln) w ) formulas in T also takes 0(w) arithmetic operations each, so the time for adding 
pairs dominates. The claimed running time is now immediate. I 

Once again, we are in a position to apply Theorem [T3l and thus obtain: 

Corollary 31 (Implicit learning in sparse bounded cutting planes) Let a list of w-sparse 
L-bounded cutting planes formulas ipi,...,<f£ and (ft be given, and suppose that partial assignments 
are drawn from a masking process for an underlying distribution D; suppose further that either 

1 . There exists some list of cutting planes formulas ift± , . . . , iftf. such that partial assignments from 
the masking process are witnessed to satisfy ip±, . . . ,ipk with probability at least (1 — e + 7) 
and there is a w-sparse L-bounded cutting planes derivation of (ft from ipi, ... , tp£, ifti,...,ip)~ 
or else 

2. [(pi A • • • A ipe =>■ (ft] is at most (1 — e — 7) -valid with respect to D for 7 > 0. 

Then, there an algorithm running in time 0{ w+m& *^^ L(Ln) w (L(Ln) w +l) 2 log |) (given unit cost 
arithmetic operations) that distinguishes these cases with probability 1 — 5 when given (ft, <p\, . . . ,<p£, 
e, 7, and a sample of log partial assignments. 

5 The utility of knowledge with imperfect validity 

Although our introduction of PAC-Semantics was primarily motivated by our need for a weaker 
guarantee that could be feasibly satisfied by inductive learning algorithms, it turns out to provide 
a windfall from the standpoint of several other classic issues in artificial intelligence. Several such 
examples are discussed by Valiant [50] H we will dwell on two core, related problems here, the 

3 Concerning a related, but slightly different framework — there, "unspecified" is taken to be a third value, on par 
with "true" and "false," which may be treated specially in reasoning. 
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frame and qualification problems, first discussed by McCarthy and Hayes [39 . The frame problem 
essentially concerns the efficient representation of what changes - and what doesn't - as the result 
of an action (stressed in this form by Raphael [S]). The traditional solutions to this problem - 
first suggested by Sandewall [48J, with a variety of subsequent formalizations including notably, 
McCarthy's circumscription [HTJ [38] and Reiter's defaults [35] and "successor state axioms" [36] 
- all essentially are (informally) captured by asserting in one way or another that (normally) 
"nothing changes unless an action that changes it is taken." Putting the early methods such as 
circumscription and defaults aside (which have their own issues, cf. Hanks and McDermott's "Yale 
shooting problem" [23]), the other approaches make the above assertion explicit, and thus encounter 
some form of the qualification problem — that is, it is essentially impossible to assert the full variety 
of reasons for and ways in which something could change or fail to change in a real-world situation. 

Thus, the successor state axioms (etc.) fully capture a toy domain at best. And yet, such 
simplified models have shown to be useful in the design of algorithms for planning — implicitly 
in early work such as Fikes and Nilsson's STRIPS [19J, and more explicitly in later work such 
as Chapman's "modal truth criterion" in his work on partial-order planning [9] and as explicit 
constraints in planning as propositional satisfiability by Kautz and Selman [26\ I27j. Indeed, such 
approaches "solve the problem" in the sense that the kinds of plans generated by such systems are 
intuitively reasonable and correspond to what is desired. 

More to the point, we can take the stance that such assumptions are merely approximations 
to the real- world situation that may fail for various unanticipated reasons, and so while the plans 
generated on their basis may likewise fail for unanticipated reasons, this does not detract from 
the utility of the plans under ordinary circumstances. Indeed, supposing we take a discrete-time 
probabilistic (e.g., Markovian) model of the evolution of the world, we might reasonably expect 
that if we consider the marginal distribution over successive world states, that formulas such as 
the successor state axioms would be (1 — e)-valid with respect to this distribution for some small 
(but nonzero) e. Of course, this view of the solutions to the frame problem is not novel to this 
work, and it has been expressed since the earliest works on probabilistic models in planning [151 121]. 
The point is rather that such examples of what are effectively (1 — e)-valid rules arise naturally in 
applications, and we claim that just as PAC-Semantics captures the sense in which learned rules 
are (approximately) "true," PAC-Semantics also captures the sense in which these approximate 
rules (e.g., as used in planning) are "true." 

6 Directions for future work 

A broad possible direction for future work involves the development of algorithms for reasoning 
in PAC-Semantics directly, that is, not obtained by applying Theorem [13] to algorithms for the 
limited decision problems under the classical (worst-case) semantics of the proof systems. We will 
give some concrete suggestions for how this might be pursued below. 

6.1 Incorporating explicit learning 

One approach concerns the architecture of modern algorithms for deciding satisfiability; a well- 
known result due to Beame et al. [5] establishes that these algorithms effectively perform a search 
for resolution proofs of unsatisfiability (or, satisfying assignments), and work by Atserias et al. [3] 
shows that these algorithms (when they make certain choices at random) are effective for deciding 
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bounded-width resolution. 

The overall architecture of these modern "SAT-solvers" largely follows that of Zhang et al. |52| , 
and is based on improvements to DPLL |141 [T3] explored earlier in several other works [36} [U [22] . 
Roughly speaking, the algorithm makes an arbitrary assignment to an unassigned variable, and then 
examines what other variables must be set in order to satisfy the formula; when a contradiction is 
entailed by the algorithm's decision, a new clause is added to the formula (entailed by the existing 
clauses) and the search continues on a different setting of the variables. A few simple rules are 
used for the task of exploring the consequences of a partial setting of the variables — notably, for 
example, unit propagation: whenever all of the literals in a clause are set to false except for one 
(unset) variable, that final remaining literal must be set to true if the assignment is to satisfy the 
formula. 

One possibility for improving the power of such algorithms for reasoning under PAC-Semantics 
using examples is that one might wish to use an explicit learning algorithm such as WINNOW [M] 
to learn additional (approximately valid) rules for extending partial assignments. If we are using 
these algorithms to find resolution refutations, then when a refutation was produced by such a 
modified architecture, it would establish that the input formula is only satisfied with some low 
probability (depending on the error of the learned rules that were actually invoked during the 
algorithm's run). 

Given such a modification, one must then ask: does it actually improve the power of such 
algorithms? Work by Pipatsrisawat and Darwiche [13] (related to the above work) has shown 
that with appropriate (nondeterministic) guidance in the algorithm's decisions, such algorithms do 
actually find arbitrary (i.e., DAG-like) resolution proofs in a polynomial number of iterations. Yet, 
it is still not known whether or not a feasible decision strategy can match this. Nevertheless, their 
work (together with the work of Atserias et al. [3]) provides a potential starting point for such an 
analysis. 

6.1.1 A suggestion for empirical work 

Another obvious direction for future work is the development and tuning of real systems for inference 
in PAC-Semantics. While the algorithms we have presented here illustrate that such inference can 
be theoretically rather efficient and are evocative of how one might approach the design of a real- 
world algorithm, the fact is that (1) any off-the-shelf SAT solver can be easily modified to serve this 
purpose and (2) SAT solvers have been highly optimized by years of effort. It would be far easier 
and more sensible for a group with an existing SAT solver implementation to simply make the 
following modification, and see what the results are: along the lines of Algorithm [2] for a sample 
of partial assignments {p 1 , . . . ,/o m }, the algorithm loops over i = l,...,m, taking the unmasked 
variables in p l as decisions and checks for satisfiability with respect to the remaining variables. 
Counting the fraction of the partial assignments that can be extended to satisfying assignments 
then gives a bound on the validity of the input formula. Crucially, in this approach, learned clauses 
are shared across samples. Given that there is a common resolution proof across instances (cf. 
the connection between SAT solvers and resolution [5]) we would expect this sharing to lead to a 
faster running time than simply running the SAT solver as a black box on the formulas obtained 
by "plugging in" the partial assignments (although that is another approach). 
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6.2 Exploiting limited kinds of masking processes 

Another direction for possibly making more sophisticated use of the examples in reasoning under 
PAC-Semantics involves restricting the masking processes. In the pursuit of reasoning algorithms, 
it might be helpful to consider restrictions that allow some possibility of "extrapolating" from the 
values of variables seen on one example to the values of hidden variables in other examples (which 
is not possible in general since the masking process is allowed to "see" the example before choosing 
which entries to mask). For example, if the masks were chosen independently of the underlying 
examples, this might enable such guessing to be useful. 

6.3 Relating implicit learning to query-driven explicit learning 

A final question that is raised by this work is whether or not it might be possible to extend the 
algorithm used in Theorem [T3"l Algorithm [H to produce an explicit proof from an explicit set of 
formulas that are satisfied with high probability from e.g., algorithms for finding treelike resolution 
proofs even when the CNF we need is not perfectly valid. Although this is a somewhat ambitious 
goal, if one takes Algorithm [1] as a starting point, the problem is of a similar form to one considered 
by Dvir et al. [16] — there, they considered learning decision trees from restrictions of the target 
tree. The main catch here is that in contrast to their setting, we are not guaranteed that we find 
restrictions of the same underlying proof, even when one is assumed to exist. 
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Appendix 

A The necessity of computationally feasible witnessing 

We now show that it is necessary for our implicit learning problem that any notion of witnessing 
we use possess some kind of efficient algorithm. Broadly speaking, we are supposing that we use 
some class of "axiom" formulas A such that whenever the collection of axioms {«i, . . . Q A 
satisfy the our candidate witnessing property W (given as a relation over, say, formulas and partial 
assignments) under the masking process with probability (1 — e) (guaranteeing that a.\ A • • • A ot^ 
is (1 — e)-valid for the underlying distribution D), and there exists a proof II of the query ip in 
the limited set S from the set of hypotheses {ot\, . . . , then the algorithm certifies the (1 — e)- 
validity of the query <p under D. Now, in general, we would expect that in any "reasonable" proof 
system and class of "simple" proofs S, the hypotheses should have trivial proofs (namely, they can 
be asserted immediately) and therefore the efficient algorithm we are seeking should certify the 
(1 — e)-validity of any member of A whenever the property W holds for the masking process with 
probability (1 — e). (We will repeat this argument slightly more formally in Proposition 1321 below.) 

In summary, this means precisely that for such a collection A, there is an algorithm such 
that on input a £ A (and 5,j > 0) and given an oracle for examples, for any distribution over 
masked examples given by a masking process applied to a distribution over scenes M(D), with 
probability at least 1 — 8 the algorithm correctly decides whether Pr pe ^(£))[VF(a, p)] > 1 — e + 7 
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or Y'i x& £,[a{x) = 0] > e + 7 (given that one of these cases holds) in time polynomial in the size 
of the domain, I/7, log 1/5, logl/e, and the size of a. We refer to this algorithm as an efficient 
PA C- Certification of W for A, and it serves as a kind of efficient evaluation algorithm for W. 

We now restate these observations more formally: any notion of "witnessing" underlying an 
implicit learning algorithm in the style of Theorem [13] must be efficiently evaluable on partial 
assignments and therefore also verifiable from examples. 

Proposition 32 (Witnessing of axioms must be computationally feasible) Let S be a set 

of proofs for a proof system such that any explicit hypothesis has a proof in S. Let A be a set of 
formulas and W be a property of formulas. 

Suppose that there is a probabilistic algorithm running in time polynomial in the number of 
variables n, the size of the query and set of hypotheses, I/7, and the number of bits of precision 
of the parameters e and 5 with the following behavior: given a query formula <p, e, 5, 7 € (0,1), 
query access to example partial assignments from a masking process M over a distribution over 
assignments D, and a list of hypothesis formulas H , distinguishes 

• queries tp such that [H =>■ ip] is not (1 — e — j) -valid under D from 

• queries that have a proof in S from H' = H U A 1 for some A' C A such that 

Pr [Va € A' W(a, p)] > 1 - e + 7. 

pGM(D) 

Then there is a probabilistic polynomial time algorithm that on input a 6 A and p distinguishes 
pairs for which W holds from pairs for which there is some x consistent with p such that a{x) = 0. 

Moreover, for {a±, . . . ,a k } and an oracle for examples from some distribution over partial 
assignments M(D), we can distinguish 

Pr \W(a 1} p) A--- A W(a k ,p)] > l-e + 7 

p£M(D) 

from cases where ct\ A • • • A a k is not (1 — e — -valid with probability 1 — 5 in time polynomial in 
I/7, log 1/e, log 1/5, the size of the domain, and the size of a± A • • • A a k . 

Proof: We will first argue that W has efficient PAC-Certification for A. Following the argument 
sketched above, let any a € A and e,5, 7 E (0,1) be given. We then simply run our hypothetical 
algorithm with query a and H empty. We know that this algorithm then runs in time polynomial 
in |a|, I/7, log 1/(5, and log 1/e. Furthermore, if a is not (1 — e — 7)-valid (i.e., ¥t x <zd[oi{x) = 0] > 
e + 7), then we know the algorithm must detect this with probability 1 — 5. Likewise, if a satisfies 
'PTpeM(D)\W( a iP)] — 1 — 6 + 7) then for A' = {a}, there is a proof of a from A' in S and our 
algorithm is guaranteed to recognize that we are in the second case with probability 1 — 5. So we 
see that the efficient PAC-Certification of W for A is immediate. 

Let any partial assignment p be given, and consider the family of point distributions D y for y 
consistent with p with the masking process M that obscures precisely the entries hidden in p. Then 
for every such y, the distribution M(D y ) is a point distribution that produces p with probability 
1. Consider the behavior of the algorithm for efficient PAC-Certification of W for A given access 
to such a distribution (which is trivially simulated given p) with say e = 1/2, 7 = 1/4. 

Suppose that p is consistent with some y for which a(y) = 0. Then in such a case, Pr^gn [a(x) = 
0] = 1 > e+7, so when given examples from M(D y ) (and hence, when given p as every example) the 



27 



algorithm must decide that the second case holds. Now, suppose on the other hand that W(a, p) 
holds; then since our distribution produces p with probability 1, the algorithm must decide the first 
case holds. Thus, our modified algorithm is as needed for the first part. 

For the second part, we note that running the algorithm from the first part on each example and 
each partial assignment from a sample of size 0(l/7 2 log 1/5), and checking whether the fraction 
of times W was decided to hold for all k formulas exceeded 1 — e suffices to distinguish the two 
cases by the usual concentration bounds. ■ 

Our notion of witnessed values is clearly one that suffices for any family of axioms A. By 
contrast, we now see that for example, we cannot in general take W to be the collection of pairs 
(a, p) such that for every x consistent with p a{x) = 1 - arguably, the most natural candidate (and 
in particular, the notion originally used by Michael [ID]) - since this may be NP-complete, e.g., for 
3-DNF formulas, and so is presumably not feasible to check. (We remark that our notion actually 
coincides with this one in the case of CNF formulas, which is the relevant class of formulas for the 
resolution proof system.) 

B On the analysis of the algorithm for bounded-space treelike 
resolution 

We note that we can associate an optimal clause space to a given derivation using the following 
recurrence (often used to define the equivalent pebble number of a tree): 

Proposition 33 The optimal space derivation for a treelike resolution proof corresponding to a 
given tree can be obtained recursively as follows: 

• The space of a single node is 1 . 

• The space of the root of a tree with two subtrees derivable in space s is s + 1 . 

• The space of the root of a tree with subtrees derivable in space s > s' is s. 

Proof: We proceed by induction on the structure of the tree, of course, and a proof of a clause 
must assert that clause in the final step, so any proof must use one clause's worth of space (which 
is attained for the sources - axioms - of the proof). Furthermore, it is clear that for any node of 
a tree, given that the formula holds for the subtrees rooted at that node, the formula continues to 
hold: if one subtree requires more space than the other, we can derive the clause labeling the root 
of the former tree in space s, and retaining that clause on the backboard, we can carry out the 
space s' derivation for the other subtree on the blackboard utilizing total space s' + 1 < s. This 
derivation is optimal since the proof derives the clauses labeling the roots of both subtrees, and 
therefore it requires at least as much space as the derivation of either subtree. 

If the subtrees both require space s, then using a derivation similar to the one described above 
(for the subtrees in arbitrary order) gives a space s + 1 derivation of the root. To see that this 
is optimal, we first note that if the blackboard is ever empty during a resolution proof, we could 
eliminate any steps prior to the step with the empty blackboard, and still obtain a legal proof, 
so we assume WLOG that the derivation when restricted to either of the subtrees always include 
at least one clause. We next note that in any derivation of one of the subtrees, by the induction 
hypothesis, there must be some blackboard configuration that contains s clauses. If this occurs 
during a derivation of the other subtree in the overall derivation, then the overall derivation uses 
at least s + 1 space. If it does not, then the conclusion of this derivation (the root of the subtree) 
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must remain on the blackboard for use in the final step of the proof; therefore, at a configuration of 
the blackboard in the derivation of the other subtree with at least s clauses, at least s + 1 clauses 
appear on the blackboard in the overall derivation. H 

Actually, Ansotegui et al. [2] refer to the clause space for treelike resolution as the Horton- 
Strahler number after the discoverers of the corresponding combinatorial parameter on trees \25\ 
[49] (which again happens to be essentially the same as the "pebble number" of the tree). The 
algorithm for efficient proof search - SearchSpace, Algorithm [2]- was, to the best of our knowledge, 
first essentially discovered as an algorithm for learning decision trees (of low pebble number) by 
Ehrenfeucht and Haussler [T7j, (we remark that the connection between treelike resolution and 
decision trees is an old bit of folklore, first appearing in the literature in a work by Lovasz et 
al. [35]) and rediscovered in the context of resolution by Kullmann [33] : the algorithm used by 
Beame and Pitassi [6] is also essentially similar, although they only considered the resulting proof 
tree size (not its space). 

Although the analysis of SearchSpace is, at its heart, a fairly straightforward recurrence, it 
requires some groundwork. We first note that whenever a bounded space treelike resolution proof 
exists, it can be converted into a (normal) form that can be discovered by SearchSpace: 

Definition 34 (Normal) We will say that a resolution proof is normal if in its corresponding 
DAG: 1. All outgoing edges from Cut nodes are directed to Cut nodes. 2. The clauses labeling any 
path from the sink to a Cut node contain literals using every variable along the path. 3. A given 
variable is used in at most one cut step and at most one weakening step along every path from a 
source to a Cut node. 

Proposition 35 For any spaces treelike resolution proof U there is a normal spaces treelike 
resolution proof H'. 

Proof: First note that in general, we don't need to use weakening steps in the proof, except 
perhaps on some initial path from a source: all other occurrences can be eliminated by deleting 
the introduced literal along the path to the sink until either a node is encountered in which the 
other incoming edge is from a clause that also features that literal or which applies the cut rule 
on that variable, redirecting the edge on this path to the cut node past it towards the sink in the 
latter case (eliminating the other branch of the proof), and then finally replacing the weakening 
node with the node leading to it. This transformation does not increase the clause space of a proof 
and leaves a treelike proof treelike. 

Once the weakening steps have been removed (i.e., in the proof cut nodes only have outgoing 
edges to other cut nodes) we can see that on any path from the sink to any cut node, at most one 
literal is introduced at each step; in particular, the set of literals on the path leading to any cut 
node is a superset of the literals in the cut node. Note that we can obtain a proof of the same 
clause space in which the internal nodes are all labeled with the clauses consisting of these sets of 
literals, by adding some additional weakening steps between the sources of the proof and the first 
cut node. Since these steps leave these chains at clause space 1, the clause space is preserved, and 
a treelike proof is still treelike. 

Finally, to guarantee the third property, we show how to eliminate additional mentions of a 
variable. While the proof is not normal, identify some offending path. For the subtree rooted at 
the occurrence of the label closest to the source of this path, replace this subtree with its child 
subtree labeled with the same clause (note that one such subtree must exist since this literal is 
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already mentioned in the clause). Note that the result is still a treelike resolution proof, and 
moreover, since the child subtree has clause space no greater than the clause space of the original 
subtree, the clause space of the new proof cannot increase. ■ 
We now describe the proof of Theorem [TBI 

Theorem 36 (SearchSpace finds space- s treelike proofs when they exist) If there is a space- 
s treelike proof of a clause C from a CNF formula if, then SearchSpace returns such a proof, and 
otherwise it returns "none." In either case, it runs in time 0(\tp\ ■ n 2 ^ -1 )) where n is the number 



Proof: Recalling Proposition [33l in any normal space-s treelike derivation of a clause C, one of 
the clauses involved in the final step must be derivable in space at most s — 1. It therefore clear that 
SearchSpace can find any normal space-s treelike proof by tracing paths from the root, choosing a 
literal labeling one of the clauses derivable in strictly smaller space first. By Proposition I35[ this is 
sufficient, and all that remains is to check the running time. 

Given W work per each invocation of SearchSpace (i.e., ignoring its recursive calls, so T(n, 1) < 
W for all n and T(l, s) < W for all s), the running time is described by the recurrence T(n, s) < 
T(n — 1, s) + 2nT(n — 1, s — 1) + W. We can verify (by induction on n and s) that W(n + l) 2 ( s_1 ) 
is a solution. Assuming the bound holds for T{n — 1, s) and T{n — 1, s — 1), (for n > 1, s > 1): 



Noting that the first case can be checked in time 0(|</?|) (for 0(|(/?|) work per node) gives the claimed 
bound. ■ 

We now establish that the bounded-space algorithm efficiently finds treelike proofs; we first 
recall the statement of Proposition [THJ 

Proposition 37 A treelike proof U can be carried out in clause space at most log 2 | TT | + 1. 

Proof: We proceed by induction on the structure of the DAG corresponding to II. For a proof 
consisting of a single node, the claim is trivial. Consider any treelike proof now; one of children of 
the root is the root of a subtree containing at most half of the nodes of the tree. By the induction 
hypothesis, this derivation can be carried out in space at most log 2 (|II|/2) + 1 = log 2 |II|, while the 
other child can be derived in space at most log 2 | IX | + 1. Therefore, by Proposition 1331 there is a 
derivation of the root in space at most log 2 | IT | + 1. ■ 
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Algorithm 3: Pseudocode for Decide- RES (k)- Width 
input : List of fc-DNF formulas tpi ... ,ipg, target width- u> /c-DNF 0, width bound oigN. 
output: Accept if there is a RES(fc) proof of cp of width w; Reject otherwise, 
begin 

Initialize a table T[ip] for every /c-DNF tp of width at most w and then set T[ifi] <— 1 
for each <pi that is a width- w /c-DNF. 
NEW 1. 
while NEW = 1 do 
if T[(f>] = 1 then 
|_ return Accept 

NEW <- 0. 

foreach k-DNF ipi of width at most w with T[ipi] = 1 or among ip\, . . . , ipt do 
foreach Formula tp' of width at most w derivable from ipi by weakening or 
A- elimination do 
if T[ip'\ = then 
L T[4>'\ <- 1; NEW <- 1 

foreach Formula ip2 of width at most w with T[ip2] = 1 or among (p±, . . . , (pg do 
if The cut rule can be applied to ip\ and ip2 yielding a k-DNF tp' of width at 
most w then 
|_ T[ip'} <- 1; NEW <- 1 

foreach j -tuple of distinct k-DNFs (ipi, . . . ,ipj) of width w with T[ipi\ = 1 (for 
i = 1, . . . , j) with j < k do 

if ^-introduction can be applied to ipi, . . . yielding a width-w k-DNF ip' then 
L T[4>'] <- 1; 7V£VF <- 1 

return Reject 
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Algorithm 4: Pseudocode for Decide-deg-<i-PC/PCR 
input : Degree bound d, list of degree-d polynomials in multilinear monomial representation 

pi, . . . ,Pi, target degree-d polynomial in multilinear monomial representation, q. 
output: Accept if there is a degree-ii polynomial calculus (resp. PCR) derivation of [q = 0]; 
Reject otherwise. 

begin 

Initialize B to the empty list. 

Initialize S <— {pi, . . . ,p?} (S also contains the complementarity polynomials x + x — 1 
for PCR). 
while S + do 

Let p be an arbitrary element of S and remove p from S 
foreach b € B in decreasing order ( while p ^ 0) do 

if The leading monomial in b is the leading monomial in p then 

p Gaussian reduction of p by b (i.e., subtract a multiple of b so that the 
leading monomials cancel). 

if p 7^ then 

Insert p into B, maintaining the decreasing order of lead monomials, 
if p has degree at most d — 1 then 
foreach indeterminate a do 
|_ Add the multilinearization of ap to S. 

foreach b € B in decreasing order ( while q / 0) do 

if The leading monomial in b is the leading monomial in q then 
[_ q Gaussian reduction of q by b 

if q = then 
|_ return Accept 

return Reject 
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Algorithm 5: DecideSparseBoundedCP 
input : Formulas ipi,...,(pg and (f>, sparsity and £i-norm bounds w,LeN. 
output: Accept if there is a L-bounded w-sparse proof of <f> of from ipi, ... , (ff, else, Reject. 
begin 

Initialize a table T[^] «— for every cutting planes formula tp of sparsity w and 

HV'lli < ^! P u t ^[V'] 1 f° r every axiom ip. if an axiom then 
[_ return Accept 

for i = 1, . . . ,£ if <fi is w-sparse do 
if ipi = ip then 
|_ return Accept 

iV£VF <- 1. 

while NEW = 1 do 

NEW <- 0. 

foreach Pair of formulas (fa, fa) in T or among <p±, . . . , ipi do 

if + fa has sparsity at most w, \\fa + fa\\ < L, and T[fa + ^2] = then 
\_ NEW <- 1; T[fa + fa] <- 1 

foreach Formula ip in T do 
for a = — L, . . . , L do 

if ||a • < L and T[a • V'] = then 
if a ■ ip = <f> then 
|_ return Accept 

|_ NEW <- 1; T[a ■ ^] <- 1 

for d = 2, . . . , L do 

if d divides if) and T[tp divided by d] = then 
if ip divided by d = (p then 
|_ return Accept 

|_ NEW <- 1; T[ip divided by d] <- 1 
_ return Reject 
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